After running the demultiplexer_for_dada2 (http://github.com/ramongallego/demultiplexer_for_dada2), we have to denoise the whole dataset. We will do this by using 4 different processes:
Estimation of Tag-jumping or indices cross-talk . We run multiple samples on each MiSeq run. These are identified by two sets of molecular barcodes. There is the potential of some sequences to be assigned to the wrong sample, which is a bummer. To estimate how many reads did this, on each MiSeq run we added some samples whose composition is known and extremely unlikely to be present in the enviromental samples studied. AS a result of this Tag-jumping, some of the positive control sequences might show in the environmental samples and viceversa. In our case, these positive controls are made of either Kangaroo or Ostrich (and Alligator). The process consists on, for each run, to model the compositon observed on the positive controls and substract it from the environmental samples from that run. The output will be a dataset with the same number of samples as before, but with fewer reads of certain sequences (ASVs)
Discarding samples with extremely low number of reads. Sometimes the number of reads sequenced from a particular replicate are really low, and hence the relative proportions of ASVs would be skewed.
Full clearance from Positive control influence. THis process also takes advantage of the known composition of the positive controls. Each ASV found in the positive controls with a higher abundace in them than in the rest of the samples will be labelled as Positive and removed from the environmental dataset. The output will be a dataset with the same number of samples as before but with fewer ASVs.
Occupancy modelling . Is the presence of a ASV a reflection of a biological reality or likely a PCR artifact? This may seem trivial in extreme cases (an ASV that only appears in one PCR replicate in the whole dataset) but how to discriminate between PCR artifacts from rare but real organisms? We use Occupancy modelling to determine if the pattern of presence of a ASV in a dataset reflects that. The output of this procedure will be a datasetwith the same number of samples as before but with fewer ASVs.
Dissimilarity between PCR replicates. The workflow that leads to the sequencing of a particular sample is subject to many stochatic processes, and it is not unlikely that the composition retrieved is very different for the original community. A way to ensure that this difference is minimal is through the separate analysis of each PCR replicate. We used that approach and modeled the dissimilarity between each PCr replicate and the group centroid. This way of modeling the dissimilarity allows us to discard those PCR replicate that won’t fit the normal distribution of dissimilarities. The output of this procedure will be a dataset with the same number of Hashes as before but with fewer samples.
As with everything, we will start the process by loading the required packages and datasets.
We will load the ASV table and the metadata file. They are in the same folder so we use list.files to access them and a neat combination of bind.rows and map(read_csv).
Hash.key <- bind_rows(map(all.hashes, read_csv))
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
A few things we check for: That no sample appears twice in the metadata. That the metadata uses Tag_01 instead of Tag_1 (so it can be sorted alphabetically). That the structure Site_YYYYMM[A-C].[1-3] is the same across the dataset.
# Check that no sample appears more than once in the metadata
metadata %>%
group_by(sample_id) %>%
summarise(tot = n()) %>%
arrange(desc(tot)) # Samples only appear once
# We should change Tag_1 for Tag_01
metadata %>%
mutate(Tag = case_when(str_detect(Tag, "\\_[0-9]{1}$") ~ str_replace(Tag, "Tag_", "Tag_0"),
TRUE ~ Tag )) -> metadata
metadata %>% mutate(x= str_count(sample_id, pattern = "_")) %>% arrange(desc(x)) # Lilliwaup 201710 has 2 underscores, run 3 metadata has no underscore
ASV.table %>% distinct(sample) %>% mutate(x= str_count(sample, pattern = "_")) %>% arrange(desc(x)) # Also in the ASV table
# Fix it here
metadata %>% mutate(x= str_count(sample_id, pattern = "_")) %>%
mutate(sample_id = case_when(x == 2 ~ str_replace(sample_id, pattern = "_B", replacement = "B"),
TRUE ~ sample_id)) %>%
dplyr::select(-x) -> metadata
ASV.table %>% mutate(x= str_count(sample, pattern = "_")) %>%
mutate(sample = case_when(x == 2 ~ str_replace(sample, pattern = "_B", replacement = "B"),
TRUE ~ sample)) %>%
dplyr::select(-x) -> ASV.table
# Done
# Change Kangaroo.4 for Kangaroo.3. Also Change Ostrich1 Ostrich2 and Ostrich3 for .1, .2 .3
metadata %>%
mutate (sample_id = case_when (str_detect(sample_id, "Ostrich[123]") ~ str_replace(sample_id, "Ostrich", "Ostrich\\."),
sample_id == "Kangaroo.4" ~ "Kangaroo.3",
TRUE ~ sample_id)) -> metadata
ASV.table %>%
mutate (sample = case_when (str_detect(sample, "Ostrich[123]") ~ str_replace(sample, "Ostrich", "Ostrich\\."),
sample == "Kangaroo.4" ~ "Kangaroo.3",
TRUE ~ sample)) -> ASV.table
# DONE
# Add underscore to run3
ASV.table %>%
filter(str_detect(sample, "[A-Z][A-Z]2017")) %>%
distinct(Miseq_run) # it only affects one run
ASV.table %>%
mutate (sample = case_when(str_detect(sample, "[A-Z][A-Z]2017") ~ str_replace(sample,"2017", "_2017" ),
TRUE ~ sample)) -> ASV.table
# ASV table from Run 1 has the full date instead of just YYYYMM
ASV.table %>%
filter(str_detect(sample, "_[:digit:]{8}")) # YES
ASV.table %>%
mutate(sample = case_when(str_detect(sample, "_[:digit:]{8}") ~ paste0(str_replace(sample, "_[:digit:]{8}", str_extract(sample,"_[:digit:]{6}" ))),
TRUE ~ sample)) -> ASV.table
# in MiSeqRun1, there are duplicated entries of some Hash / sample combos, probably bc we combined the Miseq and MiSeq nano runs
ASV.table %>%
group_by(Miseq_run, sample, Hash) %>%
summarise(nReads = sum(nReads)) %>%
ungroup -> ASV.table
The output of this process are a clean ASV table and a clean metadata file.
Before we modify our datasets on any way, we can calculate how many sequences that were only supposed to be in the positives control appeared in the environmental samples, and how many did the opposite. First we divide the dataset into positive control and environmental samples. Also create an ordered list of the Hashes present in the positive controls, for ease of plotting
ASV.table %>% mutate(source = case_when(str_detect(sample, "Kangaroo|K\\+|k\\+|Ostrich") ~ "Positives",
TRUE ~ "Samples")) -> ASV.table
ASV.table %>%
filter (source == "Positives") %>%
group_by(Hash) %>%
summarise(tot = sum(nReads)) %>%
arrange(desc(tot)) %>%
pull(Hash) -> good.order
Now let’s create a jumping vector. What proportion of the reads found in the positives control come from elsewhere, and what proportion of the reads in the samples come from the positives control.
To streamline the process and make it easier to execute it similarly but independently on each Miseq run, we nest the dataset by run. So Step1 is create a nested table so we can run this analysis on each run independently.
That wasn’t too complicated. Let’s start a summary function that keeps track of our cleaning process
We create a vector of the composition of each positive control and substract it from the environmental samples from their runs
The idea behind this procedure is that we know, for each run, how many reads from each Hash appeared in teh positive controls. These come from 2 processes: sequences we know should appear in the positive controls, and sequences that have jumped from the environment to the positive controls. With this procedure, we substract from every environmental sample the proportion of reads that jumped from elsewhere.
ASV.nested %>%
mutate(cleaned.tibble = map2(Samples, contam.tibble, function(.x,.y){
.x %>%
group_by (sample) %>%
mutate (TotalReadsperSample = sum (nReads)) %>%
left_join(.y, by = "Hash") %>%
mutate (Updated_nReads = ifelse (!is.na(vector_contamination), nReads - (ceiling(vector_contamination*TotalReadsperSample)), nReads)) %>%
filter (Updated_nReads > 0) %>%
ungroup() %>%
dplyr::select (sample, Hash, nReads = Updated_nReads)
})) -> ASV.nested
ASV.nested %>%
group_by(Miseq_run) %>%
select(cleaned.tibble) %>%
unnest(cleaned.tibble) #Check how they look
Adding missing grouping variables: `Miseq_run`
Add this step to the summary table we were creating
ASV.nested %>%
transmute(Miseq_run, Summary.1 = map(cleaned.tibble, ~ how.many(ASVtable = .,round = "1.Jump"))) %>%
left_join(ASV.summary) %>%
mutate(Summary = map2(Summary, Summary.1, bind_rows)) %>%
dplyr::select(-Summary.1) -> ASV.summary
Joining, by = "Miseq_run"
We will fit the number of reads assigned to each sample to a normal distribution and discard those samples with a probability of 95% of not fitting in that distribution. The output would be a dataset with less samples and potentially less number of unique Hashes.
ASV.nested %>%
transmute(Miseq_run, Summary.1 = map(Step.1.low.reads, ~ how.many(ASVtable = .,round = "2.Low.nReads"))) %>%
left_join(ASV.summary) %>%
mutate(Summary = map2(Summary, Summary.1, bind_rows)) %>%
dplyr::select(-Summary.1) -> ASV.summary
Joining, by = "Miseq_run"
Removing the Hashes that belong to the positive controls. First, for each Hash that appeared in the positive controls, determine whether a sequence is a true positive or a true environment. For each Hash, we will calculate, maximum, mean and total number of reads in both positive and samples, and then we will use the following decission tree:
If all three statistics are higher in one of the groups, we will label it either of Environmental or Positive control influence.
If there are conflicting results, we will use the Hashes. to see if they belong to either the maximum abundance of a Hash is in a positive, then it is a positive, otherwise is a real sequence from the environment.
Now, for each Hash in each set of positives controls, calculate the proportion of reads that were missasigned - they appeared somewhere they were not expected. We will divide that process in two: first . A second step would be to create a column named proportion switched, which states the proportion of reads from one Hash that jumped from the environment to a positive control or viceversa. The idea is that any presence below a threshold can be arguably belong to tag jumping.
ASV.table %>%
group_by(source, Hash) %>%
summarise(ocurrences =n()) %>%
spread(key = source, value = ocurrences, fill = 0) %>%
#left_join(Hashes.to.remove.step2) %>%
#mutate(origin = case_when(is.na(origin) ~ "Kept",
# TRUE ~ "Discarded")) %>%
mutate(second.origin = case_when(Positives >= Samples ~ "Discarded",
TRUE ~ "Kept")) %>%
filter(second.origin == "Discarded") %>%
full_join(Hashes.to.remove.step2) -> Hashes.to.remove.step2
Joining, by = "Hash"
IN order to train DADA2 to better distinguish when positive control sequences have arrived in the environment, we will keep the sequences in a csv file
Hashes.to.remove.step2 %>%
left_join(Hash.key) %>%
select(Hash, Sequence) %>%
write_csv("Hashes.to.remove.csv")
Joining, by = "Hash"
ASV.summary %>%
unnest()
`cols` is now required.
Please use `cols = c(Summary)`
What is the probabilty of a true positive presence of a Hash in a Miseq Run. We will use eDNA occupancy modeling to asses whether a hash is a rare variant that spilled out or a true presence.
The process requires to load extra packages, create some model file, and group the hashes by Run, and biological replicate, summarising the data in a presence absence format.
The occupancy model itself was performed in the Rmarkdown file Rjags.tunning.Rmd, so here we will upload the csv file that contains all probability of occurences of all hashes per site. Each Hash-Site combination produces a matrix of presence abascences that feeds the model - for some cases it is a 30x3 matrix, for others it is a 39x3. We summarised the number of occurences in each case and run models for each unique case (to save computing time). Each unique model was run 10 times to filter out cases in which the model converge into a local maxima.
So we will import the object Occ.fate.csv and reduce the dataset to those Hashes with an occ > 0.8
So we will throw away most of the Hashes, but will keep most of the reads - we are getting into something here
occ.results %>%
filter(real > 0.8) %>%
pull (Hash) -> to.keep
ASV.nested %>%
mutate(Step3.tibble = map (Step2.tibble, ~ filter(.,Hash %in% to.keep))) -> ASV.nested
ASV.nested %>%
transmute(Miseq_run, Summary.1 = map(Step3.tibble, ~ how.many(ASVtable = .,round = "4.Occupancy"))) %>%
left_join(ASV.summary) %>%
mutate(Summary = map2(Summary, Summary.1, bind_rows)) %>%
dplyr::select(-Summary.1) -> ASV.summary
Joining, by = "Miseq_run"
So, a second way of cleaning the dataset is to remove samples for which the dissimilarity between PCR replicates exceeds the normal distribution of dissimilarities. Sometimes the preparation of a PCR replicate goes wrong for a number of reasons - that leads to a particular PCR replicate to be substantially different to the other 2. In that case, we will remove the PCR replicate that has higher dissimilarity with the other two.
The process starts by adding the biological information to the ASV table, then diving the dataset by their biological replicate. This will also remove any sample that is not included in the metadata, eg coming from a different project.
ASV.nested %>%
select(Miseq_run, Step3.tibble) %>%
unnest(Step3.tibble) %>%
separate(sample, into = "original_sample", sep = "\\.", remove = F) -> cleaned.tibble
Expected 1 pieces. Additional pieces discarded in 102752 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Ok, so there are 13 samples for which we only have 2 PCR replicates1. We will get rid of those with only 1, as we can’t estimate the PCR bias there.
Anyway, let’s have a visual representation of the dissimilarities between PCR replicates, biological replicates and everything else.
cleaned.tibble %>%
group_by (sample) %>%
mutate (Tot = sum(nReads),
Row.sums = nReads / Tot) %>%
group_by (Hash) %>%
mutate (Colmax = max (Row.sums),
Normalized.reads = Row.sums / Colmax) -> cleaned.tibble
tibble_to_matrix <- function (tb) {
tb %>%
group_by(sample, Hash) %>%
summarise(nReads = sum(Normalized.reads)) %>%
spread ( key = "Hash", value = "nReads", fill = 0) -> matrix_1
samples <- pull (matrix_1, sample)
matrix_1 %>%
ungroup() %>%
dplyr::select ( - sample) -> matrix_1
data.matrix(matrix_1) -> matrix_1
dimnames(matrix_1)[[1]] <- samples
vegdist(matrix_1) -> matrix_1
}
tibble_to_matrix (cleaned.tibble) -> all.distances.full
#names(all.distances.full)
summary(is.na(names(all.distances.full)))
Mode FALSE
logical 770
Let’s make the pairwaise distances a long table
as.tibble(subset(melt(as.matrix(all.distances.full)))) -> all.distances.melted
`as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
[90mThis warning is displayed once per session.[39m
summary(is.na(all.distances.melted$value))
Mode FALSE
logical 592900
# Now, create a three variables for all distances, they could be PCR replicates, BIOL replicates, or from the same site
all.distances.melted %>%
separate (Var1, into = "Bottle1", sep = "\\.", remove = FALSE) %>%
separate (Bottle1, into = "Site1", remove = FALSE) %>%
separate (Var2, into ="Bottle2", sep = "\\.", remove = FALSE) %>%
separate (Bottle2, into = "Site2", remove = FALSE) %>%
mutate ( Day.site1 = str_sub(Bottle1, start = 1, end = -2),
Day.site2 = str_sub(Bottle2, start = 1, end = -2),
Distance.type = case_when( Bottle1 == Bottle2 ~ "PCR.replicates",
Day.site1 == Day.site2 ~ "Biol.replicates",
Site1 == Site2 ~ "Same Site",
TRUE ~ "Different Site"
)) %>%
dplyr::select(Sample1 = Var1, Sample2 = Var2 , value , Distance.type) %>%
filter (Sample1 != Sample2) -> all.distances.to.plot
Expected 1 pieces. Additional pieces discarded in 592900 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].Expected 1 pieces. Additional pieces discarded in 592900 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].Expected 1 pieces. Additional pieces discarded in 592900 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].Expected 1 pieces. Additional pieces discarded in 592900 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# Checking all went well
sapply(all.distances.to.plot, function(x) summary(is.na(x)))
Sample1 Sample2 value Distance.type
Mode "logical" "logical" "logical" "logical"
FALSE "592130" "592130" "592130" "592130"
all.distances.to.plot$Distance.type <- all.distances.to.plot$Distance.type %>% fct_relevel( "PCR.replicates", "Biol.replicates", "Same Site")
ggplot (all.distances.to.plot , aes (fill = Distance.type, x = value)) +
geom_histogram (position = "dodge", stat = 'density', alpha = 0.9) +
# facet_wrap( ~ Distance.type) +
labs (x = "Pairwise dissimilarity", y = "density" ,
Distance.type = "Distance")
Ignoring unknown parameters: binwidth, bins, pad
NA
# Instead of chosing based on the pw distances, we can do a similar thing using the distance to centroid
# Find out which samples have only two pcr replicates
cleaned.tibble %>% dplyr::select(-Miseq_run) %>% group_by(original_sample) %>% nest() -> nested.cleaning
nested.cleaning %>%
mutate(matrix = map(data, tibble_to_matrix)) -> nested.cleaning
nested.cleaning %>% mutate(ncomparisons = map(matrix, length)) -> nested.cleaning
dist_to_centroid <- function (x,y) {
biol <- rep(y, length(x))
if (length(biol) == 1) {
output = rep(x[1]/2,2)
names(output) <- attr(x, "Labels")
}else{
dispersion <- betadisper(x, group = biol)
output = dispersion$distances
}
output
}
nested.cleaning <- nested.cleaning %>% mutate (distances = map2(matrix, original_sample, dist_to_centroid))
unlist (nested.cleaning$distances) -> all_distances
#normparams <- fitdistr(all_pairwise_distances, "normal")$estimate
normparams <- MASS::fitdistr(all_distances, "normal")$estimate
# probs <- pnorm(all_pairwise_distances, normparams[1], normparams[2])
probs <- pnorm(all_distances, normparams[1], normparams[2])
outliers <- which(probs>0.95)
discard <-names (all_distances[outliers])
all_distances
LL_201703A.1 LL_201703A.2 LL_201703A.3 LL_201703B.1 LL_201703B.2 LL_201703B.3 LL_201703D.1 LL_201703D.2 LL_201703D.3
0.13625301 0.35880707 0.24481834 0.11786624 0.17375479 0.15871014 0.25237790 0.17664282 0.13972079
LL_201703E.1 LL_201703E.2 LL_201703E.3 LL_201703F.1 LL_201703F.2 LL_201703F.3 PO_201703A.1 PO_201703A.2 PO_201703A.3
0.18874104 0.08762146 0.12654161 0.28378689 0.17883503 0.07654715 0.16471911 0.33485159 0.07873209
PO_201703B.1 PO_201703B.2 PO_201703B.3 PO_201703C.1 PO_201703C.2 PO_201703C.3 PO_201703D.1 PO_201703D.2 PO_201703D.3
0.12452701 0.38124568 0.19686350 0.16156567 0.25456026 0.13474249 0.23816473 0.34690721 0.13841017
PO_201703E.1 PO_201703E.2 PO_201703E.3 PO_201703F.1 PO_201703F.2 PO_201703F.3 TW_201703A.1 TW_201703A.2 TW_201703A.3
0.13773561 0.30286292 0.08269833 0.13585790 0.33762093 0.16755540 0.03097821 0.41234310 0.24047281
TW_201703B.1 TW_201703B.2 TW_201703B.3 TW_201703C.1 TW_201703C.2 TW_201703C.3 TW_201703D.2 TW_201703D.3 TW_201703E.1
0.19596800 0.36860454 0.06147847 0.24031588 0.31569610 0.22246450 0.11192855 0.11192855 0.53705208
TW_201703E.2 TW_201703E.3 TW_201703F.1 TW_201703F.2 TW_201703F.3 LL_201707A.1 LL_201707A.2 LL_201707A.3 LL_201707B.1
0.13460879 0.13150055 0.09580705 0.32591372 0.24652571 0.19565941 0.40786283 0.12533363 0.19577566
LL_201707B.2 LL_201707B.3 LL_201707C.1 LL_201707C.2 LL_201707C.3 LL_201708A.1 LL_201708A.2 LL_201708A.3 LL_201708B.1
0.15587693 0.16513749 0.12837155 0.09044335 0.20694010 0.25080218 0.31451228 0.29089518 0.28849028
LL_201708B.2 LL_201708B.3 LL_201708C.1 LL_201708C.2 LL_201708C.3 PO_201707A.1 PO_201707A.2 PO_201707A.3 PO_201707B.1
0.27307191 0.30102499 0.17368227 0.23292150 0.16953950 0.16073332 0.31438009 0.25899910 0.25069954
PO_201707B.2 PO_201707B.3 PO_201707C.1 PO_201707C.2 PO_201707C.3 PO_201708A.1 PO_201708A.2 PO_201708A.3 PO_201708B.1
0.26790802 0.25799422 0.31519903 0.18404621 0.28670614 0.22407936 0.19145599 0.31770854 0.37691742
PO_201708B.2 PO_201708B.3 PO_201708C.1 PO_201708C.2 PO_201708C.3 SA_201707A.1 SA_201707A.2 SA_201707A.3 SA_201707B.1
0.32397896 0.31838533 0.20371194 0.29164108 0.22670963 0.26677486 0.16661059 0.12543549 0.28282472
SA_201707B.2 SA_201707B.3 SA_201707C.1 SA_201707C.2 SA_201707C.3 SA_201708A.1 SA_201708A.2 SA_201708A.3 SA_201708B.1
0.21057414 0.21492075 0.17424903 0.26579481 0.16401597 0.19144035 0.17536249 0.19687628 0.23920250
SA_201708B.2 SA_201708B.3 SA_201708C.1 SA_201708C.2 SA_201708C.3 TR_201707A.1 TR_201707A.2 TR_201707A.3 TR_201707B.1
0.28041200 0.20222634 0.16198006 0.18458623 0.16482170 0.38078410 0.24574496 0.18691928 0.23480187
TR_201707B.2 TR_201707B.3 TR_201707C.1 TR_201707C.2 TR_201707C.3 TR_201708A.1 TR_201708A.2 TR_201708A.3 TR_201708B.1
0.13953084 0.11437770 0.29745958 0.27987091 0.13166308 0.23142577 0.21463294 0.19578625 0.13510875
TR_201708B.2 TR_201708B.3 TR_201708C.1 TR_201708C.2 TR_201708C.3 TW_201707A.1 TW_201707A.2 TW_201707A.3 TW_201707B.1
0.36128237 0.19323695 0.11327835 0.12523905 0.13876746 0.09997909 0.18137468 0.13400328 0.09596006
TW_201707B.2 TW_201707B.3 TW_201707C.1 TW_201707C.2 TW_201707C.3 TW_201708A.1 TW_201708A.2 TW_201708A.3 TW_201708B.1
0.27326407 0.12207356 0.15080762 0.09547681 0.11968098 0.28083478 0.32789627 0.29770788 0.28716231
TW_201708B.2 TW_201708B.3 TW_201708C.1 TW_201708C.2 TW_201708C.3 LL_201705A.1 LL_201705A.2 LL_201705A.3 LL_201705B.1
0.31331817 0.17634853 0.18956755 0.32541748 0.24027209 0.12954176 0.48811052 0.42631204 0.30492711
LL_201705B.2 LL_201705B.3 LL_201705C.1 LL_201705C.2 LL_201705C.3 LL_201706A.1 LL_201706A.2 LL_201706A.3 LL_201706B.1
0.34941102 0.41831325 0.18253668 0.31187022 0.18513841 0.56530804 0.36202416 0.34475521 0.35531412
LL_201706B.2 LL_201706B.3 LL_201706C.1 LL_201706C.2 LL_201706C.3 PO_201705A.1 PO_201705A.2 PO_201705A.3 PO_201705B.1
0.37909891 0.37585944 0.48446426 0.38214867 0.37377514 0.08007319 0.43221260 0.24163768 0.19559611
PO_201705B.2 PO_201705B.3 PO_201705C.1 PO_201705C.2 PO_201705C.3 PO_201706A.1 PO_201706A.2 PO_201706A.3 PO_201706B.1
0.28648086 0.46859631 0.08816983 0.07915061 0.06754419 0.37558200 0.40250367 0.48139216 0.33057062
PO_201706B.2 PO_201706B.3 PO_201706C.1 PO_201706C.2 PO_201706C.3 SA_201705A.1 SA_201705A.2 SA_201705A.3 SA_201705B.1
0.46966150 0.45009175 0.23509942 0.28978770 0.34004489 0.30394183 0.29724296 0.31578554 0.28114732
SA_201705B.2 SA_201705B.3 SA_201705C.1 SA_201705C.2 SA_201705C.3 SA_201706A.1 SA_201706A.2 SA_201706A.3 SA_201706B.1
0.20616198 0.26757541 0.36679177 0.33242289 0.43883854 0.26551454 0.34030673 0.20479601 0.21376698
SA_201706B.2 SA_201706B.3 SA_201706C.1 SA_201706C.2 SA_201706C.3 TR_201705A.1 TR_201705A.2 TR_201705A.3 TR_201705B.1
0.24904907 0.25091453 0.43107378 0.35012786 0.40933099 0.18653370 0.21615233 0.32541795 0.30795000
TR_201705B.2 TR_201705B.3 TR_201705C.1 TR_201705C.2 TR_201705C.3 TR_201706A.1 TR_201706A.2 TR_201706A.3 TR_201706B.1
0.21313227 0.20223004 0.22978337 0.26355714 0.26880833 0.47653719 0.36284917 0.44234845 0.42696790
TR_201706B.2 TR_201706B.3 TR_201706C.1 TR_201706C.2 TR_201706C.3 TW_201705A.1 TW_201705A.2 TW_201705A.3 TW_201705B.1
0.37846865 0.42697207 0.37347632 0.41453446 0.45008975 0.34701836 0.39979918 0.38399138 0.41793660
TW_201705B.2 TW_201705B.3 TW_201705C.1 TW_201705C.2 TW_201705C.3 TW_201706A.1 TW_201706A.2 TW_201706B.1 TW_201706B.2
0.21959514 0.49431996 0.19049863 0.40741979 0.29835490 0.19308072 0.19308072 0.21292407 0.26251765
TW_201706B.3 TW_201706C.1 TW_201706C.2 TW_201706C.3 CP_201709A.1 CP_201709A.2 CP_201709A.3 CP_201709B.1 CP_201709B.2
0.34853135 0.42927899 0.37043719 0.54961074 0.28346536 0.26419108 0.38925021 0.31723262 0.32281930
CP_201709B.3 CP_201709C.1 CP_201709C.3 CP_201710A.1 CP_201710A.2 CP_201710A.3 CP_201710B.1 CP_201710B.2 CP_201710B.3
0.36571132 0.23905735 0.23905735 0.32726693 0.26619898 0.43612053 0.35641148 0.31747358 0.47663887
CP_201710C.1 CP_201710C.2 CP_201710C.3 CP_201711A.1 CP_201711A.2 CP_201711A.3 CP_201711B.1 CP_201711B.2 CP_201711B.3
0.47434020 0.40350614 0.22359320 0.51470154 0.34584551 0.34546043 0.33377101 0.36916055 0.39501211
CP_201711C.1 CP_201711C.2 CP_201711C.3 FH_201709A.1 FH_201709A.2 FH_201709A.3 FH_201709B.1 FH_201709B.2 FH_201709B.3
0.40595165 0.33673637 0.48307825 0.33644332 0.35209724 0.26551978 0.41035606 0.30702749 0.32067942
FH_201709C.1 FH_201709C.2 FH_201709C.3 FH_201710A.1 FH_201710A.2 FH_201710A.3 FH_201710B.1 FH_201710B.2 FH_201710B.3
0.26647980 0.19417930 0.76447430 0.26813546 0.29281847 0.32829062 0.27233127 0.37126246 0.41221457
FH_201710C.1 FH_201710C.2 FH_201710C.3 FH_201711A.1 FH_201711A.2 FH_201711A.3 FH_201711B.1 FH_201711B.2 FH_201711B.3
0.17129342 0.20285746 0.29600655 0.34155977 0.33063567 0.30661559 0.43475539 0.24497509 0.36555637
FH_201711C.1 FH_201711C.2 FH_201711C.3 LK_201709A.1 LK_201709A.2 LK_201709A.3 LK_201709B.1 LK_201709B.2 LK_201709B.3
0.33712471 0.35624770 0.23467041 0.35645449 0.35413460 0.33921717 0.32551911 0.34602147 0.52233141
LK_201709C.1 LK_201709C.2 LK_201709C.3 LK_201710A.2 LK_201710A.3 LK_201710B.1 LK_201710B.2 LK_201710B.3 LK_201710C.1
0.35532169 0.34811992 0.35041892 0.27608950 0.27608950 0.22730033 0.32654848 0.43138500 0.27753417
LK_201710C.2 LK_201710C.3 LK_201711A.2 LK_201711A.3 LK_201711B.1 LK_201711B.2 LK_201711B.3 LK_201711C.1 LK_201711C.2
0.29766125 0.36382022 0.24666728 0.24666728 0.46879536 0.45024465 0.45525376 0.28981385 0.43091660
LK_201711C.3 CP_201706A.1 CP_201706A.2 CP_201706A.3 CP_201706B.1 CP_201706B.2 CP_201706B.3 CP_201706C.1 CP_201706C.2
0.32064041 0.47394873 0.46482096 0.51916585 0.39031387 0.37969046 0.42702409 0.39836255 0.45349341
CP_201706C.3 CP_201707A.1 CP_201707A.2 CP_201707A.3 CP_201707B.1 CP_201707B.2 CP_201707B.3 CP_201707C.1 CP_201707C.2
0.51752685 0.31394900 0.22743413 0.44972476 0.45832210 0.39398824 0.46587899 0.37458273 0.45897945
CP_201707C.3 CP_201708A.1 CP_201708A.2 CP_201708A.3 CP_201708B.1 CP_201708B.2 CP_201708B.3 CP_201708C.1 CP_201708C.2
0.52780300 0.30225251 0.38002402 0.28869795 0.47252551 0.24970573 0.22972008 0.16113411 0.21307242
CP_201708C.3 FH_201706A.1 FH_201706A.2 FH_201706A.3 FH_201706B.1 FH_201706B.2 FH_201706B.3 FH_201706C.1 FH_201706C.2
0.33747259 0.46184806 0.19971283 0.39732658 0.25924218 0.27109590 0.41901540 0.44738877 0.20884756
FH_201706C.3 FH_201707A.1 FH_201707A.2 FH_201707A.3 FH_201707B.1 FH_201707B.2 FH_201707B.3 FH_201707C.1 FH_201707C.2
0.42540736 0.35776616 0.30399299 0.32354568 0.23142426 0.32471335 0.38370174 0.39951644 0.21756462
FH_201707C.3 FH_201708A.1 FH_201708A.2 FH_201708A.3 FH_201708B.1 FH_201708B.2 FH_201708B.3 FH_201708C.1 FH_201708C.2
0.30328849 0.37888811 0.37000684 0.39572162 0.33982587 0.41549391 0.42939253 0.38569780 0.49341359
FH_201708C.3 LK_201706A.1 LK_201706A.2 LK_201706A.3 LK_201706B.1 LK_201706B.2 LK_201706B.3 LK_201706C.1 LK_201706C.2
0.33215175 0.48636473 0.30835058 0.44539395 0.48361955 0.46553149 0.50700960 0.35910315 0.42315450
LK_201706C.3 LK_201707A.1 LK_201707A.2 LK_201707A.3 LK_201707B.1 LK_201707B.2 LK_201707B.3 LK_201707C.1 LK_201707C.2
0.51145593 0.47484641 0.36616877 0.41655854 0.36136712 0.32792126 0.36431089 0.34199257 0.31924573
LK_201707C.3 LK_201708A.1 LK_201708A.2 LK_201708A.3 LK_201708B.1 LK_201708B.2 LK_201708B.3 LK_201708C.1 LK_201708C.2
0.46478331 0.43810258 0.48133497 0.55217919 0.29875879 0.32826587 0.50391979 0.28446651 0.28131990
LK_201708C.3 LL_201709A.1 LL_201709A.2 LL_201709A.3 LL_201709B.1 LL_201709B.2 LL_201709B.3 LL_201709C.1 LL_201709C.2
0.32600456 0.40867064 0.34286639 0.29105243 0.38507559 0.40716639 0.28675735 0.39925843 0.38629920
LL_201709C.3 LL_201711B.1 LL_201711B.2 LL_201711B.3 LL_201711C.1 LL_201711C.2 LL_201711C.3 PO_201709A.1 PO_201709A.2
0.32997238 0.35007347 0.36504310 0.53671177 0.32809664 0.57652762 0.38745339 0.43647117 0.31838672
PO_201709A.3 PO_201709B.1 PO_201709B.2 PO_201709B.3 PO_201709C.1 PO_201709C.2 PO_201709C.3 PO_201711A.1 PO_201711A.2
0.34159786 0.32463241 0.32705121 0.35063631 0.38354254 0.41703252 0.36869710 0.23649847 0.31997618
PO_201711A.3 PO_201711B.1 PO_201711B.2 PO_201711B.3 PO_201711C.1 PO_201711C.2 PO_201711C.3 SA_201703A.1 SA_201703A.2
0.35557277 0.28930790 0.56591479 0.27468177 0.34120491 0.40152095 0.36530179 0.22415599 0.37086495
SA_201703A.3 SA_201703B.1 SA_201703B.2 SA_201703B.3 SA_201703C.1 SA_201703C.2 SA_201703C.3 SA_201709A.1 SA_201709A.2
0.30826070 0.27558471 0.28427322 0.32591160 0.23167323 0.27031755 0.28827965 0.34400676 0.78225573
SA_201709A.3 SA_201709B.1 SA_201709B.2 SA_201709B.3 SA_201709C.1 SA_201709C.2 SA_201709C.3 SA_201711A.1 SA_201711A.2
0.34428146 0.51288916 0.34782554 0.26759727 0.30495660 0.38470237 0.39976843 0.29197506 0.25564084
SA_201711A.3 SA_201711C.1 SA_201711C.2 SA_201711C.3 TR_201703A.1 TR_201703A.2 TR_201703A.3 TR_201703B.1 TR_201703B.2
0.69659749 0.40263503 0.36079066 0.42258449 0.28690118 0.30483232 0.36887260 0.27470423 0.21341219
TR_201703B.3 TR_201703C.1 TR_201703C.2 TR_201703C.3 TR_201709A.1 TR_201709A.2 TR_201709A.3 TR_201709B.1 TR_201709B.2
0.32103410 0.32434053 0.30302932 0.20319089 0.42016114 0.38673967 0.24437521 0.39704547 0.27150160
TR_201709B.3 TR_201709C.1 TR_201709C.2 TR_201709C.3 TR_201711A.1 TR_201711A.2 TR_201711A.3 TR_201711C.1 TR_201711C.2
0.20033662 0.40464089 0.35630032 0.32446302 0.37615162 0.44181006 0.56579542 0.42424767 0.39447173
TR_201711C.3 TW_201709A.1 TW_201709A.2 TW_201709A.3 TW_201709B.1 TW_201709B.2 TW_201709B.3 TW_201709C.1 TW_201709C.2
0.49610701 0.40776606 0.36526207 0.43733658 0.29452128 0.34268971 0.36638265 0.35102559 0.32632103
TW_201709C.3 TW_201711A.1 TW_201711A.2 TW_201711A.3 TW_201711B.1 TW_201711B.2 TW_201711B.3 TW_201711C.1 TW_201711C.2
0.24594621 0.42210920 0.56421051 0.42146032 0.27757926 0.36819359 0.38233706 0.22374190 0.39372018
TW_201711C.3 LL_201710B.1 LL_201710B.2 LL_201710B.3 LL_201710C.1 LL_201710C.2 LL_201801A.1 LL_201801A.2 LL_201801A.3
0.36200478 0.45601005 0.37219419 0.52013086 0.40010829 0.40010829 0.47483609 0.38875689 0.34757231
LL_201801B.1 LL_201801B.2 LL_201801B.3 LL_201801C.1 LL_201801C.2 LL_201801C.3 LL_201803A.1 LL_201803A.2 LL_201803A.3
0.44365904 0.42948559 0.39678749 0.48964812 0.38989510 0.40033187 0.39725083 0.45922082 0.51017170
LL_201803B.1 LL_201803B.2 LL_201803B.3 LL_201803C.1 LL_201803C.2 LL_201803C.3 LL_201805A.1 LL_201805A.2 LL_201805A.3
0.52788641 0.38470084 0.41359849 0.41798497 0.43011507 0.39597526 0.27057211 0.44704785 0.36526226
LL_201805B.1 LL_201805B.2 LL_201805B.3 LL_201805C.1 LL_201805C.2 LL_201805C.3 LL_201807A.1 LL_201807A.2 LL_201807A.3
0.31968451 0.40926363 0.29023037 0.23004378 0.40913361 0.27166182 0.35498869 0.33911301 0.37514140
LL_201807B.1 LL_201807B.2 LL_201807B.3 LL_201807C.1 LL_201807C.2 LL_201807C.3 PO_201801A.1 PO_201801A.2 PO_201801A.3
0.38492638 0.44617453 0.40303904 0.39521449 0.36847651 0.35585727 0.41376244 0.43980797 0.46446040
PO_201801B.1 PO_201801B.2 PO_201801B.3 PO_201801C.1 PO_201801C.2 PO_201801C.3 PO_201803A.1 PO_201803A.2 PO_201803A.3
0.44717274 0.50609542 0.41916872 0.37395555 0.35126585 0.35056166 0.33937889 0.66503754 0.31771629
PO_201803B.1 PO_201803B.2 PO_201803B.3 PO_201803C.1 PO_201803C.2 PO_201803C.3 PO_201805A.1 PO_201805A.2 PO_201805A.3
0.39921752 0.50327269 0.29846570 0.26958102 0.36846599 0.32504740 0.20787672 0.29576918 0.33659144
PO_201805B.1 PO_201805B.2 PO_201805B.3 PO_201805C.1 PO_201805C.3 PO_201807A.1 PO_201807A.3 PO_201807B.1 PO_201807B.2
0.15277957 0.40404185 0.17131626 0.21529400 0.21529400 0.26729890 0.26729890 0.26230501 0.27780666
PO_201807B.3 PO_201807C.1 PO_201807C.3 SA_201801A.1 SA_201801A.2 SA_201801A.3 SA_201801B.1 SA_201801B.2 SA_201801B.3
0.24149798 0.29435501 0.29435501 0.34322437 0.38055694 0.37830349 0.33384709 0.35019003 0.32689399
SA_201801C.1 SA_201801C.2 SA_201801C.3 SA_201803A.1 SA_201803A.2 SA_201803A.3 SA_201803B.1 SA_201803B.2 SA_201803B.3
0.41283419 0.35481154 0.39044460 0.31680074 0.36624647 0.29069167 0.46765723 0.29346356 0.27191597
SA_201803C.1 SA_201803C.2 SA_201803C.3 SA_201805A.1 SA_201805A.2 SA_201805A.3 SA_201805B.1 SA_201805B.2 SA_201805B.3
0.28212755 0.30154490 0.25190997 0.35201148 0.31093295 0.23391785 0.25581609 0.35125467 0.24293564
SA_201805C.1 SA_201805C.2 SA_201805C.3 SA_201807A.2 SA_201807A.3 SA_201807B.1 SA_201807B.2 SA_201807B.3 SA_201807C.1
0.18367954 0.36957058 0.20466409 0.26160759 0.26160759 0.43231385 0.34158907 0.32004285 0.43497304
SA_201807C.2 SA_201807C.3 TR_201801A.1 TR_201801A.2 TR_201801A.3 TR_201801B.1 TR_201801B.2 TR_201801B.3 TR_201801C.1
0.33807401 0.36963061 0.43585312 0.35803229 0.33323115 0.43113413 0.47746119 0.35498717 0.40107881
TR_201801C.2 TR_201801C.3 TR_201803A.1 TR_201803A.2 TR_201803A.3 TR_201803B.1 TR_201803B.2 TR_201803B.3 TR_201803C.1
0.34318877 0.44774547 0.37630648 0.26172496 0.46540925 0.34132203 0.30491044 0.24726576 0.31232700
TR_201803C.2 TR_201803C.3 TR_201805A.1 TR_201805A.2 TR_201805A.3 TR_201805B.1 TR_201805B.2 TR_201805B.3 TR_201805C.1
0.36634969 0.33088446 0.29852685 0.45471562 0.42680875 0.26015141 0.22231166 0.31442933 0.27020586
TR_201805C.2 TR_201805C.3 TR_201807A.1 TR_201807A.2 TR_201807A.3 TR_201807B.1 TR_201807B.2 TR_201807B.3 TR_201807C.1
0.51255521 0.25408872 0.40951576 0.34973849 0.30016724 0.31013691 0.39837080 0.27234644 0.36667909
TR_201807C.2 TR_201807C.3 TW_201801A.1 TW_201801A.2 TW_201801A.3 TW_201801B.1 TW_201801B.2 TW_201801B.3 TW_201801C.1
0.26485424 0.28847145 0.26409554 0.41165353 0.31252466 0.42396978 0.39415653 0.35929665 0.25495471
TW_201801C.2 TW_201801C.3 TW_201803A.1 TW_201803A.2 TW_201803A.3 TW_201803B.1 TW_201803B.2 TW_201803B.3 TW_201803C.1
0.39970791 0.23136153 0.34567386 0.39890928 0.29058667 0.29981460 0.30015425 0.31707123 0.31838539
TW_201803C.2 TW_201803C.3 TW_201805A.1 TW_201805A.2 TW_201805A.3 TW_201805B.1 TW_201805B.2 TW_201805B.3 TW_201805C.1
0.19972135 0.21277101 0.36535580 0.39694080 0.43662486 0.45913073 0.44492963 0.45837991 0.51992699
TW_201805C.2 TW_201805C.3 TW_201807A.1 TW_201807A.2 TW_201807A.3 TW_201807B.1 TW_201807B.2 TW_201807B.3 TW_201807C.1
0.19071669 0.17632152 0.25486977 0.26029474 0.40987919 0.25639986 0.16238746 0.25383933 0.28148478
TW_201807C.3 CP_201801A.1 CP_201801A.2 CP_201801A.3 CP_201801B.1 CP_201801B.2 CP_201801B.3 CP_201801C.1 CP_201801C.2
0.28148478 0.46838362 0.60798155 0.42381648 0.54719376 0.44557946 0.33814904 0.48226179 0.42301806
CP_201801C.3 CP_201803A.1 CP_201803A.2 CP_201803A.3 CP_201803B.1 CP_201803B.2 CP_201803B.3 CP_201803C.1 CP_201803C.2
0.52646432 0.37547683 0.42643411 0.40372276 0.43293759 0.29018671 0.27049297 0.42976560 0.32824542
CP_201803C.3 CP_201805A.1 CP_201805A.2 CP_201805B.1 CP_201805B.2 CP_201805B.3 CP_201805C.1 CP_201805C.2 CP_201805C.3
0.30304161 0.41508044 0.41508044 0.51521132 0.44276395 0.38169397 0.52261528 0.37612748 0.39763700
CP_201807A.1 CP_201807A.2 CP_201807A.3 CP_201807B.1 CP_201807B.2 CP_201807B.3 CP_201807C.1 CP_201807C.2 CP_201807C.3
0.36703748 0.34679158 0.31771078 0.33890308 0.35776184 0.37889889 0.23776643 0.21772717 0.22228416
FH_201801A.1 FH_201801A.2 FH_201801A.3 FH_201801B.1 FH_201801B.2 FH_201801C.1 FH_201801C.2 FH_201801C.3 FH_201803A.1
0.22170096 0.29616715 0.15430326 0.24672380 0.24672380 0.55894561 0.27000833 0.35200297 0.38247998
FH_201803A.2 FH_201803A.3 FH_201803B.1 FH_201803B.2 FH_201803B.3 FH_201803C.1 FH_201803C.2 FH_201803C.3 FH_201805A.1
0.35944097 0.25766540 0.29907689 0.31358642 0.34990214 0.41459343 0.39403951 0.44417043 0.41581585
FH_201805A.2 FH_201805A.3 FH_201805B.1 FH_201805B.2 FH_201805B.3 FH_201805C.1 FH_201805C.2 FH_201805C.3 FH_201807A.1
0.37400734 0.44746775 0.43024264 0.41308045 0.42925367 0.48615993 0.38203457 0.42145758 0.28473096
FH_201807A.2 FH_201807A.3 FH_201807B.1 FH_201807B.2 FH_201807B.3 FH_201807C.1 FH_201807C.2 FH_201807C.3 LK_201801A.1
0.23057869 0.26977849 0.23104736 0.24195078 0.20292586 0.23980991 0.23118261 0.22521523 0.47988831
LK_201801A.2 LK_201801A.3 LK_201801B.1 LK_201801B.2 LK_201801B.3 LK_201803A.1 LK_201803A.2 LK_201803A.3 LK_201803B.1
0.45300945 0.46873130 0.43226538 0.49581928 0.32645683 0.49255834 0.32651039 0.38882879 0.43339563
LK_201803B.2 LK_201803B.3 LK_201803C.1 LK_201803C.2 LK_201803C.3 LK_201805A.1 LK_201805A.2 LK_201805A.3 LK_201805B.1
0.31307940 0.38094595 0.31812828 0.36584087 0.23524932 0.53950018 0.48575681 0.43766863 0.51803510
LK_201805B.2 LK_201805B.3 LK_201805C.1 LK_201805C.2 LK_201805C.3 LK_201807A.1 LK_201807A.2 LK_201807A.3 LK_201807B.1
0.36246334 0.48127971 0.44164127 0.47156873 0.38190732 0.28425998 0.24809516 0.27360448 0.41958509
LK_201807B.2 LK_201807B.3 LK_201807C.1 LK_201807C.2 LK_201807C.3
0.34999948 0.34744268 0.22149067 0.24829432 0.30292554
to_write_discarded <- as.tibble(all_distances[outliers]) %>% rownames_to_column("sample") %>% dplyr::select(sample,
distance_to_centroid = value)
to_write_discarded <- to_write_discarded %>% bind_rows(tibble(sample = discard.1,
distance_to_centroid = NA))
write_csv(to_write_discarded ,"discared_samples.csv")
# Who passes this filter
all_distances %>%
as.tibble() %>%
mutate(sample = names(all_distances)) %>%
filter(!sample %in% to_write_discarded$sample) %>%
separate(sample, into = "event", sep = -3) %>%
group_by(event) %>%
summarise(cases = n()) %>%
separate(event, into = c("Site", "Date"), remove = F) %>%
filter(Date != "201703") %>%
ggplot()+
geom_raster(aes(x= Date, y = Site, fill = cases))+
geom_text(aes(x= Date, y = Site, label = cases))
NA
NA
NA
Finally, let’s remove these samples from the dataset
ASV.nested %>%
mutate(Step4.tibble = map (Step3.tibble, ~ filter(.,! sample %in% to_write_discarded$sample))) -> ASV.nested
ASV.nested %>%
transmute(Miseq_run, Summary.1 = map(Step4.tibble, ~ how.many(ASVtable = .,round = "5.PCR.dissimilarity"))) %>%
left_join(ASV.summary) %>%
mutate(Summary = map2(Summary, Summary.1, bind_rows)) %>%
dplyr::select(-Summary.1) -> ASV.summary
Joining, by = "Miseq_run"
We will export the final cleaned table with four columns (Miseq_run, sample, Hash, nReads)
input <- read_csv("../Output/Hash_Key_all_together.csv")
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
ASV.summary %>%
unnest() %>%
ggplot(aes(x=Stage, y=number, fill = Stat))+
geom_line(aes(group = Miseq_run, color = Miseq_run))+
facet_grid(Stat~., scales = "free")+
theme(axis.text.x = element_text(angle = 45, hjust =1))#,